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ABSTRACT 


Correlation  is  often  Investigated  (and  tested  for  significance) 
in  situations  where  some  of  the  observations  on  one  of  the  variables 
are  missing.  Throwing  away  these  unpaired  observations  may  seem  to 
be  a  waste  of  information;  a  test  based  on  all  the  data  at  hand  would 
seemingly  be  better  than  a  test  based  on  only  some  of  the  data  avail¬ 
able.  An  exact  test  using  all  the  data,  which  is  similar  in  form  and 
distribution  to  the  usual  t  test  based  on  the  sample  correlation 
coefficient,  is  derived  and  examined.  However,  this  exact  test  proves 
to  be  a  relatively  inefficient  way  to  incorporate  the  extra  informa¬ 
tion.  This  counterintuitive  result  provides  an  interesting  lesson 
concerning  the  relationship  between  power  and  degrees  of  freedom. 

I .  INTRODUCTION 


Estimating  or  testing  the  correlation  between  two  variables  is 


a  common  problem  with  applications  reaching  into  virtually  every 
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subject  area  which  liakes  use  of  the  science  of  statistics.  In  many 
of  these  diverse  applications,  data  on  one  or  more  of  the  variables 
of  interest  may  be  lost,  missing,  or  unobtainable  for  some  of  the 
subjects.  In  a  medical  setting  where  physiological  measurements  are 
to  be  obtained,  subjects  dying,  instruments  malfunctioning,  and  other 
random  miscellaneous  situations  (the  occurrence  of  which  should  in 
no  way  be  related  to  any  of  the  variables  or  treatments)  do  lead  to 
a  missing  data  problem. 

Practitioners  of  statistics  since  Wilks  (1932)  have  realized 
that  there  is  additional  information  in  the  unpaired  observations. 

The  question  is  how  to  properly  and  efficiently  use  this  extra  infor¬ 
mation  such  that  the  resulting  test  would  be  more  powerful  than  the 
standard  t  test  based  exclusively  on  the  paired  observations.  Herein, 
a  t  test  (with  greater  degrees  of  freedom)  is  derived,  but  in  spite  of 
the  similarity  in  form,  the  t  test  with  more  degrees  of  freedom  is  in 
fact  inferior  with  respect  to  power  for  some  alternatives. 

We  have  found  this  result  to  be  pedagogically  valuable  on  several 
counts.  First,  the  beginning  student  can  be  led  down  the  primrose 
path  for  a  greater  impact  of  the  point  that  one's  intuition  about  a 
reasonable  way  to  do  things  may  not  be  infallible.  We  consider  it 
an  important  general  lesson  that  one  must  take  care  not  to  misuse 
information  thinking  that  this  is  in  some  way  preferable  to  not 
using  it  at  all.  Finally,  this  new  statistic  is  an  exception  to  the 
rule  that  more  degrees  of  freedom  yields  more  power  when  comparing 
two  exact  tests  of  similar  form. 

This  new  test  is  derived  in  much  the  same  way  as  the  usual  t 
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test  for  correlation  based  only  on  the  paired  observations.  Consider 

^Xi,yi^'  m+n  to  be  a  random  sample  of  size  m+n  from  a 

2  2 

bivariate  normal  distribution  with  parameters  ux»  0X»  Oy,  and  p. 
Suppose  m  of  the  x  values  are  missing  (or  that  we  have  m  "extra" 
y  values)  and  that  we  would  like  to  test  the  null  hypothesis  H^p^O 
versus  H^:  p  >  0.  Rearranging  the  indices  for  convenience  we 


obtain: 


... 


yl’y2,y3 . Vyn+1 . yn+m  * 


The  existence  of  an  exact  test  based  on  the  sample  product 
moment  correlation  coefficient  is  well  known,  using 
n 

r  * 

_  n  _  n 

where  x  *  Z  x. /n  and  y  ■  £  y./n  if  there  are  an  equal  number  of 
n  .  i  i  'n  .  ,  i 

i”l  i»l 

x's  and  y's,  that  is,  bivariate  observations. 

In  the  case  at  hand  (n+m)  paired  observations  are  not  available. 
Hence  one  way  to  test  Hq  is  to  discard  the  additional  unpaired  observa¬ 
tions.  But  to  do  so  would  be  discarding  some  information.  Since 
the  unpaired  y's  do  give  information  about  some  of  the  parameters,  it 
would  seem  reasonable  that  we  should  use  these  y's  in  some  fashion. 

Three  tests  are  investigated.  The  first  one  is  an  exact  test. 

The  second  one  is  based  on  the  maximum  likelihood  estimate  of  p  and 
the  third  one  is  based  on  the  generalized  likelihood  ratio.  It  will 
be  shown  that  the  generalized  likelihood  ratio  test  does  not  depend 


Z  (x  -x  )(y  -y  ) 

1«1  1  D  1  n _ 

/  n  _  2  n  —  2 

/r  (x  -x  )  i  (y,-ynr 

fi-l  1  n  i-i  1  n 


on  the  additional  information. 


II.  AN  EXACT  TEST 

Let  us  now  examine  a  test  statistic  which  uses  all  of  the  data 
and  is  similar  in  form  and  distribution  to  that  of  the  familiar  test 
procedure  for  complete  paired  samples.  The  temptation  is  to  accomodate 
the  extra  observations  of  y  in  a  straightforward  manner  by  defining 

r* '  ^vW2!172  • 

_  i  n  _  i  11+111 

where  x  «  -  Ex  and  y  -  —  Z  y,- 
n  n  i  n+m  n+m  1 

We  may  consider  r*  as  being  derived  from  the  following  naive  estimator 
of  p 

«  0  m  ti  ^  ^  i  xi  ^  a  4  nlm  a  i  ia 

p  *  t  n  /  (xrxn)(yi-yn)]/r^  (xrxn)2  ±  *  <y*'y^>  1 

o  o  1*1  i“l  i“l 

x  y 


*  [n+m)/n]*^r*  . 


We  note  that  these  individual  estimators  do  not  all  coincide  with  the 
maximum  likelihood  estimators  which  were  given  explicitly  in  this  case 
by  Anderson  (1957) ;  in  fact  the  maximum  likelihood  estimator  of  p  is  seen 
to  be  something  quite  different  from  p.  However,  the  following  theorem 
concerning  the  exact  null  distribution  of  a  test  statistic  based  on  r* 
is  quite  similar  to  the  result  for  r,  the  MLE  in  the  complete  sample 
case. 

Theorem 

1/2  2  1/2 

Under  the  hypothesis  H^:  p  *  0,  t*  ■  r*(n+m-2)  /(1-r*  )  is 

distributed  as  Student  t  with  n+m-2  degrees  of  freedom. 
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Proof:  Define  two  (n-hn)  x  1  vectors  Z^  and  with  elements 


“li 


and 


"2i 


x.-x  if  i  -  1,  2,  . . . ,  n 
l  n 


if  i  *  n  +  l,...,n  +  m 


yi  '  yn-Hn  f0r  1  "  1*2 . n  +  "* 


2  1 

Hence  Z„  -  MVN  (0,  o  (I  - —  J)),  where  1  is  an  (n+m)  x  (n+m)  identity 

— £  —  y  n+m 

matrix  and  J  is  an  (n+m)  x  (n+m)  matrix  of  ones.  The  conditional  dis¬ 
tribution  of  b  *  (Z^’^Z  )  ^Z^'Z^2  *s  normal  with  mean  0  and  variance 
-1  2 

(2^'Z^)  0^,  given  the  x's. 

Further  let 


V  -  (Z2-M1V<22-bZI>  ■  ZJ  I  -  z2  . 


Finally  let  W  be  the  product  of  the  matrix  of  the  quadratic  form,  V, 
and  the  covariance  matrix  of  Thus  W  has  the  form 


The  quadratic  form  V  has  a  chi-square  distribution  if  and  only  if  W 

is  idempotent.  The  ldempotence  of  W  can  be  easily  shown  and  by 

2  2 

inspection  the  rank  (trace)  of  W  is  n  +  t  -  2.  Thus  V/o^-xu|m  ^ 
(conditioned  on  ^) .  Now  b  and  V  are  independent  if 


0  . 


Noting  that  and  J  are  orthogonal,  the  independence  follows 
immediately.  Therefore  t*  ■  tb(Z’Z  )^2/o  ]/[(V/oS/(n+m-2)]^2~ 

“in  y  y 

conditional  on  Z^.  Rewriting  in  terms  of  Z^  and  Z ^  yields 


t 


n+m- 2 


t*  -  r*(tt+m-2)1'/2/(l-r*2)lj/2 

Since  the  conditional  null  distribution  of  the  quantity*  t*,  in 
no  way  depends  on  JZ^(X's),  it  is  the  unconditional  distribution  as  well. 
Q.E.D. 

This  test  is  appealing  in  its  simplicity  and  hence  our  next  concern 
is  its  efficiency.  There  is  even  some  cause  for  optimism  because  of 
the  increased  degrees  of  freedom.  The  theorem  and  proof  above  give 
the  null  distribution  of  t*;  however  the  derivation  of  the  non-null 
distribution  of  t*  is  not  a  simple  task.  For  any  simple  alternative 
hypothesis  that  p  ■  p^  ^  0,  the  distribution  of  t*  is  not  a  non¬ 
central  t  except  in  the  special  case  where  m  *  and  hence  the 
exact  power  can  be  calculated  in  this  case.  The  analytical  calculation 
of  the  power  function  in  cases  other  than  m*“  is  difficult.  However,  a 
small  scale  sampling  experiment  is  sufficient  to  demonstrate  the  bad 
news... that  we  would  be  better  off  to  throw  the  extra  y*s  away  rather 
than  to  use  them  as  in  t*. 

Ill .  Power  Comparison 

The  power  functions  of  these  two  competing  procedures  (t*  .  - 

rmn-z 

and  t  are  compared  using  a  small  Monte  Carlo  study.  David  (1954) 
has  tabulated  the  distribution  of  r  for  different  values  of  n  and  p. 
Using  her  tables  the  true  power  of  the  test  based  on  r  is  given  in 
Table  1  both  for  comparison  with  the  empirical  power  of  t*  and  as  a 
validation  of  the  simulation  by  their  close  agreement  with  the 
empirical  power  of  t. 

For  the  comparison  of  t  „  with  t*  .  2000  samples  were 

n-z  n+m-z 

generated  with  n  x’s  and  n+m  y's.  In  each  case,  n  ■  10  while  m 


TABLE  1 


Power  of  tests  of  Hq:  p  *  0  vs.  p  >  0  (a  *  .05) 
Monte  Carlo  Estimates  Based  on  2000  Samples 


p 

True  Power 

t(10,10) 

Empirical  Power 

t(10,10) 

t*(10,20) 

t*(10,50) 

t*(10,-») 

m 

.050 

.054 

.047 

.050 

.052 

.10 

.085 

.096 

.082 

.086 

.082 

.20 

.138 

.146 

.136 

.142 

.135 

.30 

.215 

.228 

.211 

.213 

.210 

.45 

.390 

.393 

.374 

.367 

.357 

.623 

.620 

.587 

.560 

.536 

.75 

.867 

.860 

.792 

.764 

.739 

.993 

.994 

.946 

.921 

.897 

t 


Maximum  Standard  Error  of  Table  Entries  is  .011 
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varied  over  the  values  0,  10,  40  and  •  (limiting  case  of  known  p^ 
and  0^) .  Across  columns  the  samples  consist  of  the  same  random 
samples  of  10  paired  observations  plus  possibly  some  additional 
unpaired  y's  to  stabilize  the  comparison  of  procedures.  The 
computations  were  done  on  the  Univac  1100  computer  at  BGSU.  The 
random  normal  deviates  were  generated  using  the  IMSL  library 
subroutine  GGNOR  with  one  of  their  recommended  seeds. 

There  is  no  evidence  that  t*  is  more  powerful  than  t  for  any 
value  of  m  for  any  alternative  value  of  p.  On  the  other  hand,  t 
is  in  fact  significantly  more  powerful  than  t*  for  the  larger 
(p  _>  .60)  alternatives.  Thus  t*  cannot  be  recommended  over  the  usual 
t-test  which  is  based  upon  only  the  paired  observations. 

As  m  -*■  “,  we  get  the  smallest  power  for  values  of  p  .45,  which 

may  be  surprising  considering  that  as  m  »  we  say  we  "know"  p^  and 

0^.  More  and  more  power  is  lost  (relative  to  the  standard  paired 

procedure)  as  the  true  value  of  p  departs  from  the  null.  The  misuse 

of  the  extra  y's  in  r*  is  more  and  more  evident  as  the  underlying 

true  correlation  increases  and  as  the  amount  of  unpaired  data  begins 

to  dominate  the  paired  data.  It  should  be  noted  that  in  this  limiting 

case  of  "known"  p  and  a  ,  t*  becomes 

y^  y  n 

t*(n,®)  -  I  l  (x  -x)(y  -p  )/o  ]/[  I  (x  -x)2]1/2 
i-i  1  1  y  y  i-i  1 

which  is  distributed  normally  with  mean  zero  and  variance  one  under 
the  null  hypothesis. 

The  poor  performance  of  even  the  limiting  case  of  r*  is  in  line 
with  Wilks'  (1932)  estimation  results  of  five  decades  ago.  Thus,  in 
spite  of  the  fact  that  r*  provides  an  exact  test  for  p  ■  0,  it  repre- 


sents  a  less  efficient  use  of  the  extra  y  values  than  discarding  them 


IV.  A  TEST  BASED  ON  THE  MLE 


The  simplification  available  in  the  case  of  y^  and  o^  known 
lends  itself  to  further  examination.  We  focus  attention  on  this 
case  because  the  power  function  has  been  calculated  for  the  exact 
test,  t*. 

The  ML  estimate,  p,  is  given  by  Anderson  (1957)  for  unknown 

y  ,  y  ,  a  ,  a  and  p.  From  this,  the  ML  estimate  for  known  y  and 
x  y  x  y  y 

is  easily  deduced.  Using  the  following  notation 

sly  ’  • 

S2  *  I  (x  -  x  )2 
x  .  .  i  n 
i*l 


and 


2  —  2 

Sy  -  *  <yt  -  yn> 

y  i«i  1  n 


Solution  of  the  likelihood  equations  yields 


y  *2 


hsL  <- 2 

s4 
y 


B- 


-  1/2 


It  can  be  shown  that 


1-p' 


no 


JSL 


s2(s2s2-s2  ) 

y  x  y  xy 


Letting  r 


JSL 


s  s 

x  y 


yields  r2/l-r2 


JSL 


2  2  2 

ss-s 

x  y  xy 
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and  hence 
*2 


This  implies  that  a  statistic  with  some  similarity  to  t  is 

z-^)(5)  '  [<-> 


/[S2/a2] 

y  y 


Now  letting 

V  =  (n-2) 


and  U  =  S  2/o2  yields  Z  =  V/U. 
y  y 

2 

Note  first  that  under  H^,  r  is  independent  of  yi»y2»**,,yn 

2 

[see  the  derivation  of  the  conditional  distribution  of  r  given  y 

in  Hogg  and  Craig  (1970)  or  Johnson  and  Kotz  (1970)].  Therefore 
2 

r  2  2 

(n-2)  - j  is  independent  of  S  lo  ,  that  is,  V  is  independent  of 

1-r  y  y 

xx  2  2  2 

U.  Also  note  that  V  %  F(l,n-2)  and  Sy/cr^  ^  X  (n-1) .  Therefore  the 
p.d.f.  of  the  random  variable  Z  is  [see  Mood,  Graybill  and  Boes  (1974), 
P.  187] 

oo 

f  (z)  =  /  |u|f  (u)f  (zu)du 
*  —00  V 


'(¥) 


l-n 

2  2  (n-2) 


n-1  n-3 

1  \  2  2  -u/2 

t  U  e 


■w 

l  1  ^ 

(zu)1/2 

\  n-2  ) 

I1  +  ^2  *“} 

n-1 

2 

-1/2  -1/2 
z 

n-2 

»  2 

-u/2 

r(i)  i?  -  0 


/  f 
0 


(£) 


I 

zu>  2 


du  . 


The  integral  does  not  have  a  closed  form  expression.  It  was  evaluated 
using  15-point  Laguerre  integration  and  verified  using  Whittaker 
functions. 

To  find  the  critical  point  we  seek  c  such  that  for  a  one-sided 
a  «  0.05  test 

P[Z  >  c]  *  0.10 
and  then  by  symmetry  we  have 

P[/n-2  —?MPgy  2.  •'nc’l  *  P[Z  >_  <'nc]  ”  0.05. 

/l-p  1 

Using  a  numerical  search  routine,  c  was  found  to  be  0.47717 
and  hence  for  the  statistic  Z^  the  critical  value  is  2.1844  for 
n  =  10  and  a  =  .05  (one  sided  test).  A  Monte  Carlo  study  with  one 
thousand  samples  of  size  n  =  10  with  p  *  0  exhibited  an  empirical 
type  I  error  rate  of  exactly  0.05. 

A  Monte  Carlo  power  study  to  estimate  the  power  of  this  test 
based  on  Z^,  along  with  the  empirical  powers  of  the  other  two 
competing  tests  based  on  t  and  t*  is  reported  in  Table  2.  The 
table  also  presents  the  true  powers  of  t  (10,10)  and  t*  (10, "0. 

There  is  no  evidence  that  the  test  based  on  Z^  is  more  powerful  than 
the  one  based  on  t  or  t*  for  all  alternative  values  of  p.  In  fact, 

Z^  has  significantly  less  power  for  0.1  £  p  <_  0.5.  Figure  1  displays 
the  power  curves  of  the  three  tests.  The  fact  that  the  test  based  on 
t*  is  the  most  powerful  of  the  three  near  p  *  0  is  not  discernible;  yet 
even  though  we  noted  in  Section  III  that  t*  was  inferior  to  t  for 
large  p,  it  can  be  shown  that  t*  is  a  locally  most  powerful 
invariant  test  [Ahmad  and  Girl  (1979)].  It  is  also  clear  that  this 
local  optimality  is  rather  inconsequential  in  light  of  the  price 


TABLE  2 


Power  of  tests  of  H  :  p  *  0  vs.  H. :  p  >  0  (a  *  0.05) 
o  l 

Monte  Carlo  Estimates  Based  on  2500  samples 


p 

True  Power 

t(10.10) 

Empirical  Power 
t(10,10)  Z^IO,®) 

t*(10,») 

True  Power 

t*(10,®) 

.00 

.050 

.054 

.055 

.053 

.050 

.10 

.085 

.089 

.079 

.096 

.0874 

.20 

.138 

.147 

.116 

.153 

.1418 

.30 

.215 

.224 

.188 

.218 

.2153 

.40 

.321 

.312 

.277 

.307 

.3087 

.45 

.390 

.381 

.325 

.368 

.3624 

.50 

.459 

.441 

.387 

.415 

.4202 

.60 

.623 

.618 

.584 

.536 

.5450 

.75 

.867 

.867 

.874 

.739 

.7385 

.90 

.993 

.994 

.999 

.905 

.9033 

Maximum  Standard  Error  of  Table  Entries 


niGOTI 


POWER  CURVES  FOR  TH] 


that  is  paid  at  the  larger  values  of  p.  Finally,  it  seems  apparent 
from  Figure  1  that  the  usual  t-test  is  preferable  to  either  t*  (DIP) 
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or  Z^MLE). 


V.  The  Generalized  Likelihood  Ratio  Test 
For  completeness  it  is  interesting  to  obtain  the  generalized 

i 

likelihood  ratio  statistic  as  a  final  competitor  to  the  three  tests 
of  the  previous  section.  Consider  again  the  situation  in  which  we 
have  m  additional  observations  on  Y. 

Following  Anderson  (1957) ,  we  let 


Thus  the  likelihood  ratio  will  depend  only  upon  the  two  maxima  of 
the  products  of  the  conditional  densities  of  the  x^. 


Under  the  null  hypothesis  p  ■  0  .which  implies  that  B  -  0 


and  that  the  MLEs  are 


v  *  x 


and 


:2  -  S/n  . 

a  x 

x*y 


The  unconstrained  maximization  yields  the  usual  estimates  of  regression 
parameters. 


8  «  S  /S  , 

xy  xy  xy 


v  *  x  -  8  y 
n  xy  n 


and 


*2  1,2  ;  .2, 
a  -  — (S  -  8  S  ) 
x*y  n  x  xy  y 


(S2S2  -  s2  ) /ns2, 
x  y  xy  y 


It  follows  algebraically  that 


A  - 


L(H0> 

1<V 


(l-r2)”/2 


Notice  that  this  does  not  depend  upon  the  additional  observations  and 
offers  still  further  support  for  the  use  of  the  familiar  t  statistic. 


V.  Summary 

An  exact  test  on  the  correlation  coefficient  which  uses  all  the 
available  data  has  been  derived.  In  spite  of  the  increase  in  degrees 
of  freedom,  the  test  cannot  be  recommended  over  the  usual  t  test  based 
only  on  the  paired  observations.  The  exact  test  is  actually  locally 


most  powerful  Invariant  but  this  advantage  quickly  disappears  for 
moderately  large  alternative  values  of  p.  The  test  based  on  the 
MLE  of  p  also  uses  all  of  the  data  but  is  not  better  than  the  t 
test  based  only  on  the  paired  observations.  Finally,  the  genera¬ 
lized  likelihood  ratio  test  is  seen  to  be  equivalent  to  the  familiar 

t 

t  test;  the  unpaired  observations  are  ignored.  Thus  in  spite  of  the 
fact  that  t*  1)  provides  an  exact  test,  2)  has  greater  degrees  of 
freedom,  3)  uses  all  the  data  and  4)  is  locally  most  powerful 
invariant  —  all  of  which  are  desirable  qualities  —  it  is  inferior 
(practically  speaking)  to  the  familiar  t  test,  which  is  based  only 
on  the  paired  observations. 
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